Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-13185][SQL] Reuse Calendar object in DateTimeUtils.StringToDate method to improve performance #11090

Closed
wants to merge 1 commit into from

Conversation

carsonwang
Copy link
Contributor

The java Calendar object is expensive to create. I have a sub query like this SELECT a, b, c FROM table UV WHERE (datediff(UV.visitDate, '1997-01-01')>=0 AND datediff(UV.visitDate, '2015-01-01')<=0))

The table stores visitDate as String type and has 3 billion records. A Calendar object is created every time DateTimeUtils.stringToDate is called. By reusing the Calendar object, I saw about 20 seconds performance improvement for this stage.

@carsonwang
Copy link
Contributor Author

/cc @srowen @rxin

@SparkQA
Copy link

SparkQA commented Feb 5, 2016

Test build #50810 has finished for PR 11090 at commit 5bbbf91.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Feb 5, 2016

This kind of optimization always feels icky to me but it is pretty self contained. Are there more instances of this across the code that we can at least gather and handle in one place?

@carsonwang
Copy link
Contributor Author

Reusing a Calendar object when the method will be called frequently is something recommended by "Effective Java" mentioned here. The more the method is invoked, the more performance improvement we get.

stringToTimestamp has the same problem but it uses different TimeZones instead of only a GMT TimeZone. We need cache the TimeZones in a hash map as I did in #11071. It is a little more involved and there is a case that the Calendar cannot be reused. So I'll keep it as is and make this PR as small as possible for now. If it is worth doing, I'll be happy to submit a new PR to address it.

@srowen
Copy link
Member

srowen commented Feb 6, 2016

I'm not arguing with the optimization, just moaning at the use of ThreadLocal. This one at least isn't likely to cause a memory leak. Yes, I understand that the more use, the better, which is why it'd be great to see if we can do this once to optimize lots of call sites.

However now that I am back in front of the computer I don't see any. Most are in test code. There are some silly calls to Calendar.getInstance().getTime() but these won't hurt.

Yeah, clear() doesn't un-set the time zone right? Which is OK in this method.

Is your 20 seconds improvement total wall-clock time difference over all records -- as in saving 6 nanoseconds per record? although I like optimization that sounds very small.

#11071 doesn't seem to be correct; is that intended to be part of this change?

@carsonwang
Copy link
Contributor Author

@srowen The 20 seconds improvements is the difference of the stage time. i.e. before the patch, the stage runs 1.6 min. With this path it runs 1.2 min. It takes about 1 second to create 1 million Calendar records. So it saves about 1 microsecond per record.

Right, clear() doesn't un-set the time zone.

#11071 also wants to optimize the stringToTimestamp method. In my case, the code doesn't call stringToTimestamp. I just noticed it also creates Calendar and TimeZone objects which may have the similar issue. Anyway, we can fix the issue in stringToDate first as this is more obvious?

@srowen
Copy link
Member

srowen commented Feb 14, 2016

OK, I think this is OK to merge

@srowen
Copy link
Member

srowen commented Feb 14, 2016

Jenkins, retest this please

@SparkQA
Copy link

SparkQA commented Feb 14, 2016

Test build #51269 has finished for PR 11090 at commit 5bbbf91.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Feb 15, 2016

Thanks - I've merged this in master.

@asfgit asfgit closed this in 7cb4d74 Feb 15, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants